Information Retrieval Using Robust Natural Language Processing
نویسندگان
چکیده
We developed a fully automated Information Retrieval System which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval. In early experiments with the standard CACM-3204 collection of abstracts, the augmented system has displayed capabilities that made it clearly superior to the purely statistical base system. 1. O V E R A L L D E S I G N Our information retrieval system consists of a traditional statistical backbone (Harman and Candela, 1989) augmented with various natural language processing components that assist the system in database processing (stemming, indexing, word and phrase clustering, selectional restrictions), and translate a user's information request into an effective query. This design is a careful compromise between purely statistical non-linguistic approaches and those requiring rather accomplished (and expensive) semantic analysis of data, often referred to as 'conceptual retrieval'. The conceptual retrieval systems, though quite effective, are not yet mature enough to be considered in serious information retrieval applications, the major problems being their extreme inefficiency and the need for manual encoding of domain knowledge (Mauldin, 1991). In our system the database text is first processed with a fast syntactic parser. Subsequently certain types of phrases are extracted from the parse lxees and used as compound indexing terms in addition to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of similarity links between smaller subphrases and words occurring in them. A further filtering process maps these similarity links onto semantic relations (generalization, specialization, synonymy, etc.) after which they are used to transform user's request into a search query. The user's natural language request is also parsed, and all indexing terms occurring in them are identified. Next, certain highly ambiguous (usually single-word) terms are dropped, provided that they also occur as elements in some compound terms. For example, "natural" is deleted from a query already containing "natural language" because 206 "natural" occurs in many unrelated contexts: "natural number", "natural logarithm", "natural approach", etc. At the same time, other terms may be added, namely those which are linked to some query term through admissible similarity relations. For example, "fortran" is added to a query containing the compound term "program language" via a specification link. After the final query is constructed, the database search follows, and a ranked list of documents is returned. It should be noted that all the processing steps, those performed by the backbone system, and these performed by the natural language processing components, are fully automated, and no human intervention or manual encoding is required. 2. F A S T P A R S I N G W I T H T T P TIP (Tagged Text Parser) is based on the Linguistic String Grammar developed by Sager (1981). Written in Quintus Prolog, the parser currently encompasses more than 400 grammar productions. It produces regularized parse tree representations for each sentence that reflect the sentence's logical structure. The parser is equipped with a powerful skip-and-fit recovery mechanism that allows it to operate effectively in the face of ill-formed input or under a severe time pressure. In the recent experiments with approximately 6 million words of English texts, 1 the parser's speed averaged between 0.45 and 0.5 seconds per sentence, or up to 2600 words per minute, on a 21 MIPS SparcStation ELC. Some details of the parser are discussed below. 2 TIP is a full grammar parser, and initially, it attempts to generate a complete analysis for each sentence. However, unlike an ordinary parser, it has a built-in timer which regulates the amount of time allowed for parsing any one sentence. If a parse is not returned before the allotted time I These include CACM-3204, MUC-3, and a selection of nearly 6,000 technical articles extracted from Computer Library database (a Zfff Communications Inc. CD-ROM). 2 A complete description can be found in (Strzalkowski, 1991). elapses, the parser enters the skip-and-fit mode in which it will try to "fit" the parse. While in the skip-and-fit mode, the parser will attempt to forcibly reduce incomplete constituents, possibly skipping portions of input in order to restart processing at a next unattempted constituent. In other words, the parser will favor reduction to backtracking while in the skip-and-fit mode. The result of this strategy is an approximate parse, partially fitted using top-down predictions. The fragments skipped in the first pass are not thrown out, instead they are analyzed by a simple phrasal parser that looks for noun phrases and relative clauses and then attaches the recovered material to the main parse structure. As an illustration, consider the following sentence taken from the CACM-3204 corpus: The method is i l lustrated by the automatic construct ion o f both recursive and iterative programs operating on natural numbers , lists, and trees, in order to construct a program satisfying certain specifications a theorem induced by those specifications is proved, and the desired program is extracted f rom the proof. The italicized fragment is likely to cause additional complications in parsing this lengthy string, and the parser may be better off ignoring this fragment altogether. To do so successfully, the parser must close the currently open constituent (i.e., reduce a program satisfying certain specifications to NP), and possibly a few of its parent constituents, removing corresponding productions from further consideration, until an appropriate production is reactivated. In this case, T IP may force the following reductions: SI --> to V NP; SA ----> SI; S ---> NP V NP SA, until the production S --> S and S is reached. Next, the parser skips input to find and, and resumes normal processing. As may be expected, the skip-and-fit strategy will only be effective if the input skipping can be performed with a degree of determinism. This means that most of the lexical level ambiguity must be removed from the input text, prior to parsing. We achieve this using a stochastic parts of speech tagger 3 to preprocess the text. 3. WORD SUFFIX TRIMMER Word stemming has been an effective way of improving document recall since it reduces words to their common morphological root, thus allowing more successful matches. On the other hand, stemming tends to decrease retrieval precision, if care is not taken to prevent situations where otherwise unrelated words are reduced to the same stem. In our system we replaced a traditional morphological stemmer with a conservative dictionary-assisted suffix trimmer. 4 The suffix trimmer performs essentially two tasks: (1) it reduces inflected word forms to their root forms as specified in the dictionary, and (2) it converts nominalized verb 3 Courtesy of Bolt Beranek and Newman. 4 We use Oxford Advanced Leamer's Dictionary (OALD) MRD. forms (eg. "implementation", "storage") to the root forms of corresponding verbs (i.e., "implement", "store"). This is accomplished by removing a standard suffix, eg. "stor+age", replacing it with a standard root ending ("+e"), and checking the newly created word against the dictionary, i.e., we check whether the original root ("storage") is defined using the new root ("store"). This allows reducing "diversion" to "diverse" while preventing "version" to be replaced by "verse". Experiments with CACM-3204 collection show an improvement in retrieval precision by 6% to 8% over the base system equipped with a standard morphological stemmer (the SMART stemmer). 4. HEAD-MODIFIER STRUCTURES Syntactic phrases extracted from TTP parse trees are headmodifier pairs: from simple word pairs to complex nested structures. The head in such a pair is a central element of a phrase (verb, main noun, etc.) while the modifier is one of the adjunct arguments of the head. 5 For example, the phrase fast algorithm for parsing context-free languages yields the following pairs: algorithm+fast, algorithm+parse, parse+language, language+context_free. The following types of pairs were considered: (1) a head noun and its left adjective or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb, These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. For example, the pair [retrieve,information] is extracted from any of the following fragments: information retrieval system; retrieval of information from databases; and information that can be retrieved by a user-controlled interactive search process. 6 An example is shown in the appendix .7 5. TERM CORRELATIONS FROM TEXT Head-modifier pairs form compound terms used in database indexing. They also serve as occurrence contexts for smaller terms, including single-word terms. In order to determine whether such pairs signify any important association between terms, we calculate the value of the 5 In the experiments reported here we extracted head-modifier word pairs only. CACM collection is too small to warrant generation of larger compounds, because of their low frequencies. To deal with nominal compounds we use frequency information about the pairs generated from the entire corpus to form preferences in ambiguous situations, such as natural language processing vs. dynamic information processing. 7 Note that working with the parsed text ensures a high degree of precision in capturing the meaningful phrases, which is especially evident when compared with the results usually obtained from either unprocessed or only partially processed text (Lewis and Croft, 1990).
منابع مشابه
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval
Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe...
متن کاملA robust speech understanding system using conceptual relational grammar
We describe a robust speech understanding system based on our newly developed approach to spoken language processing. We show that a robust NLU system can be rapidly developed using a relatively simple speech recognizer to provide sufficient information for database retrieval by spoken language. Our experimental system consists of three components: a speech recognizer based on HMM, a natural la...
متن کاملRobust Text Processing in Automated Information Retrieval
We report on the results of a series of experiments with a prototype text retrieval system which uses relatively advanced natural language processing techniques in order to enhance the effectiveness of statistical document retrieval. In this paper we show that large-scale natural language processing (hundreds of millions of words and more) is not only required for a better retrieval, but it is ...
متن کاملInformation Retrieval Tasks
Techniques of automatic natural language processing have been under development since the earliest computing machines, and in recent years these techniques have proven to be robust, reliable and efficient enough to lead to commercial products in many areas. The applications include machine translation, natural language interfaces and the stylistic analysis of texts but NLP techniques have also ...
متن کاملLinguistic Approaches to Text Management: An Appraisal of Progress
Since the earliest days of computing it has been a goal to automatically process natural language text with a view to identifying its content. Information retrieval has traditionally relied on keyword-based indexing to deliver retrieval for users in response to queries, a basis which is inadequant given the vagaries of natural language. In this article we examine why natural language processing...
متن کاملRobust Text Processing and Information Retrieval
The general objective of this research has been the enhancement of traditional key-word based statistical methods of document retrieval with advanced natural language processing techniques. In the work to date the focus has been on obtaining a better representation of document contents by extracting representative phrases from syntactically preprocessed text and devising suitable weighting sche...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1992